The Ford GoBike System Dataset is a valuable resource for understanding the usage patterns and trends of a bike-sharing system. It contains data related to the Ford GoBike system, which is a bike-sharing service that provides a convenient and sustainable transportation option for people in various communities. The dataset includes information on the number of trips taken, trip distances, trip durations, and user demographics, among other things. Analyzing this data can help decision-makers understand how the system is being used, identify areas for improvement, and make data-driven decisions to meet the needs of users. Whether you are a researcher, a transportation planner, or simply someone interested in the use of bike-sharing systems, the Ford GoBike System Dataset provides a wealth of information for exploring and understanding this important transportation option.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from haversine import Unit
import haversine as hs
import plotly.io as pio
pio.renderers.default = "notebook_connected"
import plotly.offline as ofl
ofl.init_notebook_mode()
import plotly.graph_objects as go
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
# read data file
df =pd.read_csv('201902-fordgobike-tripdata.csv')
df
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No |
| 1 | 42521 | 2019-02-28 18:53:21.7890 | 2019-03-01 06:42:03.0560 | 23.0 | The Embarcadero at Steuart St | 37.791464 | -122.391034 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 2535 | Customer | NaN | NaN | No |
| 2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No |
| 3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No |
| 4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183407 | 480 | 2019-02-01 00:04:49.7240 | 2019-02-01 00:12:50.0340 | 27.0 | Beale St at Harrison St | 37.788059 | -122.391865 | 324.0 | Union Square (Powell St at Post St) | 37.788300 | -122.408531 | 4832 | Subscriber | 1996.0 | Male | No |
| 183408 | 313 | 2019-02-01 00:05:34.7440 | 2019-02-01 00:10:48.5020 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 66.0 | 3rd St at Townsend St | 37.778742 | -122.392741 | 4960 | Subscriber | 1984.0 | Male | No |
| 183409 | 141 | 2019-02-01 00:06:05.5490 | 2019-02-01 00:08:27.2200 | 278.0 | The Alameda at Bush St | 37.331932 | -121.904888 | 277.0 | Morrison Ave at Julian St | 37.333658 | -121.908586 | 3824 | Subscriber | 1990.0 | Male | Yes |
| 183410 | 139 | 2019-02-01 00:05:34.3600 | 2019-02-01 00:07:54.2870 | 220.0 | San Pablo Ave at MLK Jr Way | 37.811351 | -122.273422 | 216.0 | San Pablo Ave at 27th St | 37.817827 | -122.275698 | 5095 | Subscriber | 1988.0 | Male | No |
| 183411 | 271 | 2019-02-01 00:00:20.6360 | 2019-02-01 00:04:52.0580 | 24.0 | Spear St at Folsom St | 37.789677 | -122.390428 | 37.0 | 2nd St at Folsom St | 37.785000 | -122.395936 | 1057 | Subscriber | 1989.0 | Male | No |
183412 rows × 16 columns
#display data types and non-null
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
df.shape
(183412, 16)
# diplay the mean and std and overview of data
df.describe()
| duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | |
|---|---|---|---|---|---|---|---|---|---|
| count | 183412.000000 | 183215.000000 | 183412.000000 | 183412.000000 | 183215.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 175147.000000 |
| mean | 726.078435 | 138.590427 | 37.771223 | -122.352664 | 136.249123 | 37.771427 | -122.352250 | 4472.906375 | 1984.806437 |
| std | 1794.389780 | 111.778864 | 0.099581 | 0.117097 | 111.515131 | 0.099490 | 0.116673 | 1664.383394 | 10.116689 |
| min | 61.000000 | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 |
| 25% | 325.000000 | 47.000000 | 37.770083 | -122.412408 | 44.000000 | 37.770407 | -122.411726 | 3777.000000 | 1980.000000 |
| 50% | 514.000000 | 104.000000 | 37.780760 | -122.398285 | 100.000000 | 37.781010 | -122.398279 | 4958.000000 | 1987.000000 |
| 75% | 796.000000 | 239.000000 | 37.797280 | -122.286533 | 235.000000 | 37.797320 | -122.288045 | 5502.000000 | 1992.000000 |
| max | 85444.000000 | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 |
the dataset contains 16 colunms and 183,412 rows , there are incorrect colunm data type such as:(1) start_time and end_time must be date time , (2) start_station_id ,end_station_id member_birth_year must be int
The most importent feature i think is user_type , i would like to lock for the reaasons that lead peason to subscribe
i will creat new colunm that called distence to investegte if that affect on user_type or not, also there are the colunms such as age and gender
# calculte how mane null valuse in each colunm
df.isna().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
# dupliction check
df.duplicated().sum()
0
# drop null vluse
df.dropna(inplace=True)
# Change data types to correct types
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['start_station_id'] = df['start_station_id'].astype('int')
df['end_station_id'] = df['end_station_id'].astype('int')
df['member_birth_year'] = df['member_birth_year'].astype('int')
# Ref ==> https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b
def distance(obs):
loc1=(obs['start_station_latitude'], obs['start_station_longitude'])
loc2=(obs['end_station_latitude'], obs['end_station_longitude'])
return hs.haversine(loc1,loc2,unit=Unit.METERS)
df['distance'] = df.apply(distance,axis=1)
# convert duration_sec to minuts and drop duration_sec
df['duration_min'] = df['duration_sec'] / 60
df.drop('duration_sec',axis=1,inplace=True)
# creat new colunm caled age
df['age']=2019 - df['member_birth_year']
df.describe()
| start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | distance | duration_min | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 |
| mean | 139.002126 | 37.771220 | -122.351760 | 136.604486 | 37.771414 | -122.351335 | 4482.587555 | 1984.803135 | 1690.051442 | 11.733379 | 34.196865 |
| std | 111.648819 | 0.100391 | 0.117732 | 111.335635 | 0.100295 | 0.117294 | 1659.195937 | 10.118731 | 1096.958237 | 27.370082 | 10.118731 |
| min | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 | 0.000000 | 1.016667 | 18.000000 |
| 25% | 47.000000 | 37.770407 | -122.411901 | 44.000000 | 37.770407 | -122.411647 | 3799.000000 | 1980.000000 | 910.444462 | 5.383333 | 27.000000 |
| 50% | 104.000000 | 37.780760 | -122.398279 | 101.000000 | 37.781010 | -122.397437 | 4960.000000 | 1987.000000 | 1429.831313 | 8.500000 | 32.000000 |
| 75% | 239.000000 | 37.797320 | -122.283093 | 238.000000 | 37.797673 | -122.286533 | 5505.000000 | 1992.000000 | 2224.012981 | 13.150000 | 39.000000 |
| max | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 | 69469.336637 | 1409.133333 | 141.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start_time 174952 non-null datetime64[ns] 1 end_time 174952 non-null datetime64[ns] 2 start_station_id 174952 non-null int32 3 start_station_name 174952 non-null object 4 start_station_latitude 174952 non-null float64 5 start_station_longitude 174952 non-null float64 6 end_station_id 174952 non-null int32 7 end_station_name 174952 non-null object 8 end_station_latitude 174952 non-null float64 9 end_station_longitude 174952 non-null float64 10 bike_id 174952 non-null int64 11 user_type 174952 non-null object 12 member_birth_year 174952 non-null int32 13 member_gender 174952 non-null object 14 bike_share_for_all_trip 174952 non-null object 15 distance 174952 non-null float64 16 duration_min 174952 non-null float64 17 age 174952 non-null int32 dtypes: datetime64[ns](2), float64(6), int32(4), int64(1), object(5) memory usage: 22.7+ MB
df.drop('member_birth_year',axis=1, inplace = True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start_time 174952 non-null datetime64[ns] 1 end_time 174952 non-null datetime64[ns] 2 start_station_id 174952 non-null int32 3 start_station_name 174952 non-null object 4 start_station_latitude 174952 non-null float64 5 start_station_longitude 174952 non-null float64 6 end_station_id 174952 non-null int32 7 end_station_name 174952 non-null object 8 end_station_latitude 174952 non-null float64 9 end_station_longitude 174952 non-null float64 10 bike_id 174952 non-null int64 11 user_type 174952 non-null object 12 member_gender 174952 non-null object 13 bike_share_for_all_trip 174952 non-null object 14 distance 174952 non-null float64 15 duration_min 174952 non-null float64 16 age 174952 non-null int32 dtypes: datetime64[ns](2), float64(6), int32(3), int64(1), object(5) memory usage: 22.0+ MB
df.describe()
| start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | distance | duration_min | age | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 |
| mean | 139.002126 | 37.771220 | -122.351760 | 136.604486 | 37.771414 | -122.351335 | 4482.587555 | 1690.051442 | 11.733379 | 34.196865 |
| std | 111.648819 | 0.100391 | 0.117732 | 111.335635 | 0.100295 | 0.117294 | 1659.195937 | 1096.958237 | 27.370082 | 10.118731 |
| min | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 0.000000 | 1.016667 | 18.000000 |
| 25% | 47.000000 | 37.770407 | -122.411901 | 44.000000 | 37.770407 | -122.411647 | 3799.000000 | 910.444462 | 5.383333 | 27.000000 |
| 50% | 104.000000 | 37.780760 | -122.398279 | 101.000000 | 37.781010 | -122.397437 | 4960.000000 | 1429.831313 | 8.500000 | 32.000000 |
| 75% | 239.000000 | 37.797320 | -122.283093 | 238.000000 | 37.797673 | -122.286533 | 5505.000000 | 2224.012981 | 13.150000 | 39.000000 |
| max | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 69469.336637 | 1409.133333 | 141.000000 |
# creat new colunms that displys days and hours
df['day'] = df['start_time'].dt.weekday
df['hour'] = df['start_time'].dt.hour
day_dic={ 0: 'Saturday', 1: 'Sunday',2: 'Monday', 3: 'Tuesday', 4: 'Wednesday', 5: 'Thursday', 6: 'Friday'}
df['day']=df['day'].map(day_dic)
df.head(5)
| start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_gender | bike_share_for_all_trip | distance | duration_min | age | day | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-02-28 17:32:10.145 | 2019-03-01 08:01:55.975 | 21 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | Male | No | 544.709256 | 869.750000 | 35 | Tuesday | 17 |
| 2 | 2019-02-28 12:13:13.218 | 2019-03-01 05:24:08.146 | 86 | Market St at Dolores St | 37.769305 | -122.426826 | 3 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | Male | No | 2704.548867 | 1030.900000 | 47 | Tuesday | 12 |
| 3 | 2019-02-28 17:54:26.010 | 2019-03-01 04:02:36.842 | 375 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | Other | No | 260.738904 | 608.166667 | 30 | Tuesday | 17 |
| 4 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | Male | Yes | 2409.304744 | 26.416667 | 45 | Tuesday | 23 |
| 5 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | Male | No | 3332.207230 | 29.883333 | 60 | Tuesday | 23 |
# remove gender that is Other
df = df[df['member_gender'] != 'Other']
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 171305 entries, 0 to 183411 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start_time 171305 non-null datetime64[ns] 1 end_time 171305 non-null datetime64[ns] 2 start_station_id 171305 non-null int32 3 start_station_name 171305 non-null object 4 start_station_latitude 171305 non-null float64 5 start_station_longitude 171305 non-null float64 6 end_station_id 171305 non-null int32 7 end_station_name 171305 non-null object 8 end_station_latitude 171305 non-null float64 9 end_station_longitude 171305 non-null float64 10 bike_id 171305 non-null int64 11 user_type 171305 non-null object 12 member_gender 171305 non-null object 13 bike_share_for_all_trip 171305 non-null object 14 distance 171305 non-null float64 15 duration_min 171305 non-null float64 16 age 171305 non-null int32 17 day 171305 non-null object 18 hour 171305 non-null int64 dtypes: datetime64[ns](2), float64(6), int32(3), int64(2), object(6) memory usage: 24.2+ MB
# create new colunm that shows age group
def age_groub(age):
if 10<age<=20:
return '11-20'
elif 20<age<=30:
return '21-30'
elif 30<age<=40:
return '31-40'
elif 40<age<=50:
return '41-50'
elif 50<age<=60:
return '51-60'
else:
return'>60'
df['age_group'] = df['age'].map(age_groub)
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\802946479.py:20: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.describe()
| start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | distance | duration_min | age | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 171305.00000 | 171305.000000 | 171305.000000 | 171305.000000 | 171305.000000 | 171305.000000 | 171305.000000 | 171305.000000 | 171305.000000 | 171305.000000 | 171305.000000 |
| mean | 138.70695 | 37.770629 | -122.351657 | 136.304889 | 37.770831 | -122.351225 | 4481.294136 | 1687.777172 | 11.629300 | 34.160649 | 13.451545 |
| std | 111.71479 | 0.101225 | 0.118522 | 111.421147 | 0.101130 | 0.118088 | 1659.524197 | 1094.411591 | 26.287562 | 10.116083 | 4.733722 |
| min | 3.00000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 0.000000 | 1.016667 | 18.000000 | 0.000000 |
| 25% | 47.00000 | 37.770083 | -122.411901 | 44.000000 | 37.770407 | -122.411647 | 3796.000000 | 908.552902 | 5.366667 | 27.000000 | 9.000000 |
| 50% | 104.00000 | 37.780760 | -122.398279 | 100.000000 | 37.781010 | -122.397437 | 4960.000000 | 1428.327735 | 8.483333 | 32.000000 | 14.000000 |
| 75% | 239.00000 | 37.797280 | -122.283127 | 237.000000 | 37.797320 | -122.287610 | 5505.000000 | 2217.417808 | 13.116667 | 39.000000 | 17.000000 |
| max | 398.00000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 69469.336637 | 1409.133333 | 141.000000 | 23.000000 |
days_ordered=['Saturday','Sunday','Monday','Tuesday','Wednesday','Thursday','Friday']
df['days_ordered'] = pd.Categorical(df['day'], categories=days_ordered, ordered=True)
age_ord=['11-20','21-30', '31-40', '41-50', '51-60', '>60']
df['age_group_ord']=pd.Categorical(df['age_group'], categories=age_ord, ordered=True)
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\208641027.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\P1\AppData\Local\Temp\ipykernel_25432\208641027.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#Export clean data to csv
df.to_csv("201902-fordgobike-tripdata-clean.csv", index=False)
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
plt.figure(figsize=[14,7])
fig=sns.barplot(x=df['user_type'].value_counts().index,y=df['user_type'].value_counts(),palette='mako')
fig.set(ylabel='Frequency',xlabel='User Type',title='Distribution of User Type');
As we see above, the most of users are subscribers
plt.figure(figsize=[14,7])
fig=sns.barplot(x=df['member_gender'].value_counts().index,y=df['member_gender'].value_counts(),palette='mako')
fig.set(ylabel='Frequency',xlabel='Gender',title='Distribution of gender');
As we see above, the most of users are Male
df['day'].value_counts()
Tuesday 32984 Sunday 30022 Monday 27825 Wednesday 27083 Saturday 25106 Friday 14183 Thursday 14102 Name: day, dtype: int64
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['days_ordered'].value_counts().index, y= df['day'].value_counts(),palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='Number of trips each day' );
As we see above, Friday and Thursday less than other day in terms of users
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['hour'].value_counts().index, y= df['hour'].value_counts(),palette='mako');
fig.set(xlabel='Hour', ylabel='Frequency',title='Number of trips each Hour' );
As we see above, the most of users use this service strting 7:00 AM - 9:00 AM and from 4:00PM to 6:00PM
plt.figure(figsize=[14,7])
fig=sns.histplot(data=df, x="age",kde=True,binwidth=2,palette='mako');
plt.title('Distribution of Age');
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\190368916.py:3: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['age_group_ord'].value_counts().index, y= df['age_group'].value_counts(),palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='Age Group Frequency' );
As we see above, most of users are more less thn 41
plt.figure(figsize=[14,7])
fig = sns.boxplot(y='age',data=df ,palette='mako')
fig.set( ylabel='User Age',title='Box Plot of User Age' );
As we see above, moore than 75% of users are more than 60 age it seems outlier but i believe is not, because it possible some one his age like 70 and use this service
plt.figure(figsize=[14,7])
fig = sns.boxplot(y='distance',data=df ,palette='mako')
fig.set( ylabel='distance (M)',title='Trip distance' );
As we see above, there is an outlire in distance colunm
plt.figure(figsize=[14,7])
plt.hist(df[df['distance']<2000][df['distance']>0]['distance'],edgecolor='black', linewidth=1)
plt.title('Distribution of distance');
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\2611273734.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
As we see above, The most of distance are 750-1500 Meter
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['bike_share_for_all_trip'].value_counts().index,y= df['bike_share_for_all_trip'].value_counts(),palette='mako')
fig.set( ylabel='Frequency',title='Frequency of Bike is Share For All Trip')
plt.show()
As we see above, most bikes are not share for all trips
plt.figure(figsize=[14,7])
fig = sns.boxplot(y='duration_min',data=df ,palette='mako')
fig.set( ylabel='distance (M)',title='Box Plot of Trip Duration' );
plt.figure(figsize=[14,7])
fig=sns.histplot(data=df[df['duration_min']<100], x="duration_min",kde=True,binwidth=2,palette='mako');
plt.title('Distribution of Trip Duration');
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\3786305176.py:3: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
As we see above, the most of durtion is less then 20M
in gender there ware 'other' value i drop it , for user type there is no unusual point , for distance most of users btween 750 to 1500 Meters , the unusual that i found the most of users are in not weekend and also most of them in hour 8,9,18,19
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶
i found that there an outlires in copule of featurse such as :
- Age outlire from 60 -80
- Distance outlires 0 or more than 3000M
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['days_ordered'], y= df['distance'],palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='Distance By Day Week' );
As we see above, there is no corroltion btween days and distence
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['hour'], y= df['distance'],palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='Distance By Hour' );
As we see above, the clients that start in 5,6,7,8,9 AM hvae higher distence
plt.figure(figsize=[14,7])
fig = sns.barplot(y=df['duration_min'], x= df['age_group'].sort_values(),palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='Duration By Age Group' );
As we see above, >60 age mebers have higher Duration
plt.figure(figsize=[14,7])
fig = sns.barplot(y=df['duration_min'], x= df['age_group'].sort_values(),palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='Distance By Age Group' );
plt.figure(figsize=[14,7])
fig = sns.countplot(data=df,x='user_type',hue='member_gender',palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='User Type By Gender' );
As we see above, the most of subscribers are male
plt.figure(figsize=[14,7])
fig = sns.countplot(data=df,x='user_type',hue='age_group',palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='User Type By Gender' );
As we see above, subscribers and customers have the same pattren with age
plt.figure(figsize=[14,7])
fig = sns.countplot(data=df,x='day',hue='user_type',palette='mako')
fig.set(xlabel='day', ylabel='Frequency',title='Day By User Type' );
As we see above, most of subscribers used the service on weekdays.
According to user types i found the most of customers used the bike service on weekends while subscribers used the service on weekdays. Also i found the most of subscribers are Male.
the most of subscribers are Male and they used the service on weekdays.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
plt.figure(figsize=[14,7])
sns.scatterplot( data=df[df['distance']<2000],x="duration_min", y="distance", hue="age_group")
plt.title("correlation between Duration and Distance and Age Group")
plt.ylabel('Distance ')
plt.xlabel('Duration ')
plt.show()
As we see above , there is positive correlation between the duration of the trip and the age group. as we see 21-30 has positive corroltion with distence and duration
plt.figure(figsize=[14,7])
sns.scatterplot( data=df[df['distance']<2000],x="duration_min", y="distance", hue="user_type")
plt.title("correlation between Duration and Distance and User Type",size = 18)
plt.ylabel('Distance ')
plt.xlabel('Duration ')
plt.show()
As we see above , there is no correlation between the duration of the trip and the User Type.
plt.figure(figsize=[14,7])
sns.heatmap(df.corr(), annot=True,linewidth=.6)
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\660083900.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
<AxesSubplot: >
As we can see in the plot above, the location variables and their numbers are correlated with each other and that is very natural for them to have a strong correlation otherwhise there is no a strong or medium correlation between the variables, most of which are less than 10 percent.
i found that there is a positive correlation between the duration of the trip and the age group. which is the 21-30 age has positive corroltion with distence and duration.
i thought that there is a positive correlation between user type with distance and duration , that caused the subscribers consume duration and distance, but after visualization there is no
The majority of the trips taken on the bike-sharing system are between 750-1500 meters, with a significant portion of users traveling over 1,000 meters. There is no correlation between the day of the week and the distance traveled, as bikes are not shared for every trip. Fridays and Thursdays tend to have fewer users compared to other days, with the majority of users being subscribers. Members aged 60 or older have longer trip durations on average. The busiest times for trips are during the working hours of 6am to 9am and 4pm to 6pm, with Tuesday being the busiest day of the week. The median trip duration is 9 minutes, with most trips lasting around 12 minutes, and the median trip distance is 1.1 km.